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Considerations for Benchmarking Virtual Network Functions 
and Their Infrastructure 


Abstract 
The Benchmarking Methodology Working Group has traditionally 
conducted laboratory characterization of dedicated physical 
implementations of internetworking functions. This memo investigates 
additional considerations when network functions are virtualized and 
performed in general-purpose hardware. 
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1. Introduction 


The Benchmarking Methodology Working Group (BMWG) has traditionally 
conducted laboratory characterization of dedicated physical 
implementations of internetworking functions (or physical network 
functions (PNFs)). The black-box benchmarks of throughput, latency, 
forwarding rates, and others have served our industry for many years. 
[RFC1242] and [RFC2544] are the cornerstones of the work. 


A set of service provider and vendor development goals has emerged: 
reduce costs while increasing flexibility of network devices and 
drastically reduce deployment time. Network Function Virtualization 
(NFV) has the promise to achieve these goals and therefore has 
garnered much attention. It now seems certain that some network 
functions will be virtualized following the success of cloud 
computing and virtual desktops supported by sufficient network path 
capacity, performance, and widespread deployment; many of the same 
techniques will help achieve NFV. 


In the context of Virtual Network Functions (VNFs), the supporting 
Infrastructure requires general-purpose computing systems, storage 
systems, networking systems, virtualization support systems (such as 
hypervisors), and management systems for the virtual and physical 
resources. There will be many potential suppliers of Infrastructure 
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systems and significant flexibility in configuring the systems for 
best performance. There are also many potential suppliers of VNFs, 
adding to the combinations possible in this environment. The 
separation of hardware and software suppliers has a profound 
implication on benchmarking activities: much more of the internal 
configuration of the black-box Device Under Test (DUT) must now be 
specified and reported with the results, to foster both repeatability 
and comparison testing at a later time. 


Consider the following user story as further background and 
motivation: 


I’m designing and building my NFV Infrastructure platform. The 
first steps were easy because I had a small number of categories 
of VNFs to support and the VNF vendor gave hardware 
recommendations that I followed. Now I need to deploy more VNFs 
from new vendors, and there are different hardware 
recommendations. How well will the new VNFs perform on my 
existing hardware? Which among several new VNFs in a given 
category are most efficient in terms of capacity they deliver? 
And, when I operate multiple categories of VNFs (and PNFs) 
*concurrently* on a hardware platform such that they share 
resources, what are the new performance limits, and what are the 
software design choices I can make to optimize my chosen hardware 
platform? Conversely, what hardware platform upgrades should I 
pursue to increase the capacity of these concurrently operating 
VNF'S? 


See <http://www.etsi.org/technologies-clusters/technologies/nfv> for 
more background; the white papers there may be a useful starting 
place. The "NFV Performance & Portability Best Practices" document 
[NFV.PEROO1] is particularly relevant to BMWG. There are also 
documents available among the Approved ETSI NFV Specifications 
[Approved_ETSI_NFV], including documents describing Infrastructure 
performance aspects and service quality metrics, and drafts in the 
ETSI NFV Open Area [Draft_ETSI_NFV], which may also have relevance to 
benchmarking. 


1.1. Requirements Language 


The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", “SHALL NOT", 
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 
"OPTIONAL" in this document are to be interpreted as described in BCP 
14 [RFC2119] [RFC8174] when, and only when, they appear in all 
capitals, as shown here. 
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2. Scope 


At the time of this writing, BMWG is considering the new topic of 
Virtual Network Functions and related Infrastructure to ensure that 
common issues are recognized from the start; background materials 
from respective standards development organizations and Open Source 
development projects (e.g., IETF, ETSI NFV, and the Open Platform for 
Network Function Virtualization (OPNFV)) are being used. 


This memo investigates additional methodological considerations 
necessary when benchmarking VNFs instantiated and hosted in general- 
purpose hardware, using bare metal hypervisors [BareMetal] or other 
isolation environments such as Linux containers. An essential 
consideration is benchmarking physical and Virtual Network Functions 
in the same way when possible, thereby allowing direct comparison. 
Benchmarking combinations of physical and virtual devices and 
functions in a System Under Test (SUT) is another topic of keen 
interest. 


A clearly related goal is investigating benchmarks for the capacity 
of a general-purpose platform to host a plurality of VNF instances. 
Existing networking technology benchmarks will also be considered for 
adaptation to NFV and closely associated technologies. 


A non-goal is any overlap with traditional computer benchmark 
development and their specific metrics (e.g., SPECmark suites such as 
SPEC CPU). 


A continued non-goal is any form of architecture development related 
to NFV and associated technologies in BMWG, consistent with all 
chartered work since BMWG began in 1989. 


3. Considerations for Hardware and Testing 
This section lists the new considerations that must be addressed to 
benchmark VNF(s) and their supporting Infrastructure. The SUT is 
composed of the hardware platform components, the VNFs installed, and 
many other supporting systems. It is critical to document all 
aspects of the SUT to foster repeatability. 


3.1. Hardware Components 


The following new hardware components will become part of the test 
setup: 


1. High-volume server platforms (general-purpose, possibly with 
virtual technology enhancements) 
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2. Storage systems with large capacity, high speed, and high 
reliability 


3. Network interface ports specially designed for efficient service 
of many virtual Network Interface Cards (NICs) 


4. High-capacity Ethernet switches 
The components above are subjects for development of specialized 
benchmarks that focus on the special demands of network function 
deployment. 
Labs conducting comparisons of different VNFs may be able to use the 
same hardware platform over many studies, until the steady march of 
innovations overtakes their capabilities (as happens with the lab’s 
traffic generation and testing devices today). 

3.2. Configuration Parameters 
It will be necessary to configure and document the settings for the 
entire general-purpose platform to ensure repeatability and foster 
future comparisons, including, but clearly not limited to, the 
following: 
o number of server blades (shelf occupation) 
o CPUs 
o caches 
o memory 
o storage system 


o I/O 


as well as configurations that support the devices that host the VNF 
itself: 


o Hypervisor (or other forms of virtual function hosting) 
o Virtual Machine (VM) 
o Infrastructure virtual network (which interconnects virtual 


machines with physical network interfaces or with each other 
through virtual switches, for example) 
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and finally, the VNF itself, with items such as: 
o specific function being implemented in VNF 


o reserved resources for each function (e.g., CPU pinning and Non- 
Uniform Memory Access (NUMA) node assignment) 


o number of VNFs (or sub-VNF components, each with its own VM) in 
the service function chain (see Section 1.1 of [RFC7498] for a 
definition of service function chain) 


o number of physical interfaces and links transited in the service 
function chain 


In the physical device benchmarking context, most of the 
corresponding Infrastructure configuration choices were determined by 
the vendor. Although the platform itself is now one of the 
configuration variables, it is important to maintain emphasis on the 
networking benchmarks and capture the platform variables as input 
factors. 


3.3. Testing Strategies 


The concept of characterizing performance at capacity limits may 
change. For example: 


1. It may be more representative of system capacity to characterize 
the case where the VMs hosting the VNFs are operating at 50% 
utilization and therefore sharing the "real" processing power 
across many VMs. 


2. Another important test case stems from the need to partition (or 
isolate) network functions. A noisy neighbor (VM hosting a VNF 
in an infinite loop) would ideally be isolated; the performance 
of other VMs would continue according to their specifications, 
and tests would evaluate the degree of isolation. 


3. System errors will likely occur as transients, implying a 
distribution of performance characteristics with a long tail 
(like latency) and leading to the need for longer-term tests of 
each set of configuration and test parameters. 


4. The desire for elasticity and flexibility among network functions 
will include tests where there is constant flux in the number of 
VM instances, the resources the VMs require, and the setup/ 
teardown of network paths that support VM connectivity. Requests 
for and instantiation of new VMs, along with releases for VMs 
hosting VNFs that are no longer needed, would be a normal 
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operational condition. In other words, benchmarking should 
include scenarios with production life-cycle management of VMs 
and their VNFs and network connectivity in progress, including 
VNF scaling up/down operations, as well as static configurations. 


5. All physical things can fail, and benchmarking efforts can also 
examine recovery aided by the virtual architecture with different 
approaches to resiliency. 


6. The sheer number of test conditions and configuration 
combinations encourage increased efficiency, including automated 
testing arrangements, combination sub-sampling through an 
understanding of inter-relationships, and machine-readable test 
results. 


3.4. Attention to Shared Resources 


Since many components of the new NFV Infrastructure are virtual, test 
setup design must have prior knowledge of interactions/dependencies 
within the various resource domains in the SUT. For example, a 
virtual machine performing the role of a traditional tester function, 
such as generating and/or receiving traffic, should avoid sharing any 
SUT resources with the DUT. Otherwise, the results will have 
unexpected dependencies not encountered in physical device 
benchmarking. 


Note that the term "tester" has traditionally referred to devices 
dedicated to testing in BMWG literature. In this new context, 
"tester" additionally refers to functions dedicated to testing, which 
may be either virtual or physical. "Tester" has never referred to 
the individuals performing the tests. 


The possibility to use shared resources in test design while 
producing useful results remains one of the critical challenges to 
overcome. Benchmarking setups may designate isolated resources for 
the DUT and other critical support components (such as the host/ 
kernel) as the first baseline step and add other loading processes. 
The added complexity of each setup leads to shared-resource testing 
scenarios, where the characteristics of the competing load (in terms 
of memory, storage, and CPU utilization) will directly affect the 
benchmarking results (and variability of the results), but the 
results should reconcile with the baseline. 


The physical test device remains a solid foundation to compare with 

results using combinations of physical and virtual test functions or 
results using only virtual testers when necessary to assess virtual 

interfaces and other virtual functions. 


Morton Informational [Page 7] 


RFC 8172 Benchmarking VNFs and Related Infrastructure July 2017 


4. Benchmarking Considerations 


This section discusses considerations related to benchmarks 
applicable to VNFs and their associated technologies. 


4.1. Comparison with Physical Network Functions 


In order to compare the performance of VNFs and system 
implementations with their physical counterparts, identical 
benchmarks must be used. Since BMWG has already developed 
specifications for many network functions, there will be re-use of 
existing benchmarks through references, while allowing for the 
possibility of benchmark curation during development of new 
methodologies. Consideration should be given to quantifying the 
number of parallel VNFs required to achieve comparable scale/capacity 
with a given physical device or whether some limit of scale was 
reached before the VNFs could achieve the comparable level. Again, 
implementation based on different hypervisors or other virtual 
function hosting remain as critical factors in performance 
assessment. 


4.2. Continued Emphasis on Black-Box Benchmarks 


When the network functions under test are based on open-source code, 
there may be a tendency to rely on internal measurements to some 
extent, especially when the externally observable phenomena only 
support an inference of internal events (such as routing protocol 
convergence observed in the data plane). Examples include CPU/Core 
utilization, network utilization, storage utilization, and memory 
committed/used. These "white-box" metrics provide one view of the 
resource footprint of a VNF. Note that the resource utilization 
metrics do not easily match the 3x4 Matrix, described in Section 4.4. 


However, external observations remain essential as the basis for 
benchmarks. Internal observations with fixed specification and 
interpretation may be provided in parallel (as auxiliary metrics), to 
assist the development of operations procedures when the technology 
is deployed, for example. Internal metrics and measurements from 
open-source implementations may be the only direct source of 
performance results in a desired dimension, but corroborating 
external observations are still required to assure the integrity of 
measurement discipline was maintained for all reported results. 


A related aspect of benchmark development is where the scope includes 
multiple approaches to a common function under the same benchmark. 
For example, there are many ways to arrange for activation of a 
network path between interface points, and the activation times can 
be compared if the start-to-stop activation interval has a generic 
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and unambiguous definition. Thus, generic benchmark definitions are 
preferred over technology/protocol-specific definitions where 
possible. 


4.3. New Benchmarks and Related Metrics 


There will be new classes of benchmarks needed for network design and 
assistance when developing operational practices (possibly automated 
management and orchestration of deployment scale). Examples follow 
in the paragraphs below, many of which are prompted by the goals of 
increased elasticity and flexibility of the network functions, along 
with reduced deployment times. 


o Time to deploy VNFs: In cases where the general-purpose hardware 
is already deployed and ready for service, it is valuable to know 
the response time when a management system is tasked with 
"standing up" 100s of virtual machines and the VNFs they will 
host. 


o Time to migrate VNFs: In cases where a rack or shelf of hardware 
must be removed from active service, it is valuable to know the 
response time when a management system is tasked with "migrating" 
some number of virtual machines and the VNFs they currently host 
to alternate hardware that will remain in service. 


o Time to create a virtual network in the general-purpose 
Infrastructure: This is a somewhat simplified version of existing 
benchmarks for convergence time, in that the process is initiated 
by a request from (centralized or distributed) control, rather 
than inferred from network events (link failure). The successful 
response time would remain dependent on data-plane observations to 
confirm that the network is ready to perform. 


o Effect of verification measurements on performance: A complete 
VNF, or something as simple as a new policy to implement in a VNF, 
is implemented. The action to verify instantiation of the VNF or 
policy could affect performance during normal operation. 


Also, it appears to be valuable to measure traditional packet 
transfer performance metrics during the assessment of traditional and 
new benchmarks, including metrics that may be used to support service 
engineering such as the spatial composition metrics found in 
[RFC6049]. Examples include mean one-way delay in Section 4.1 of 
[RFC6049], Packet Delay Variation (PDV) in [RFC5481], and Packet 
Reordering [RFC4737] [RFC4689]. 
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4.4. Assessment of Benchmark Coverage 


It can be useful to organize benchmarks according to their applicable 
life-cycle stage and the performance criteria they were designed to 
assess. The table below (derived from [X3.102]) provides a way to 
organize benchmarks such that there is a clear indication of coverage 
for the intersection of life-cycle stages and performance criteria. 


For example, the "Time to deploy VNFs" benchmark described above 
would be placed in the intersection of Activation and Speed, making 
it clear that there are other potential performance criteria to 
benchmark, such as the "percentage of unsuccessful VM/VNF stand-ups" 
in a set of 100 attempts. This example emphasizes that the 
Activation and De-activation life-cycle stages are key areas for NFV 
and related Infrastructure and encourages expansion beyond 
traditional benchmarks for normal operation. Thus, reviewing the 
benchmark coverage using this table (sometimes called the 3x3 Matrix) 
can be a worthwhile exercise in BMWG. 


In one of the first applications of the 3x3 Matrix in BMWG 
[SDN-BENCHMARK], we discovered that metrics on measured size, 
capacity, or scale do not easily match one of the three columns 
above. Following discussion, this was resolved in two ways: 


o Add a column, Scale, for use when categorizing and assessing the 


coverage of benchmarks (without measured results). An example of 
this use is found in [OPNFV-BENCHMARK] (and a variation may be 
found in [SDN-BENCHMARK]). This is the 3x4 Matrix. 
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o If using the matrix to report results in an organized way, keep 
size, capacity, and scale metrics separate from the 3x3 Matrix and 
incorporate them in the report with other qualifications of the 
results. 


Note that the resource utilization (e.g., CPU) metrics do not fit in 
the matrix. They are not benchmarks, and omitting them confirms 
their status as auxiliary metrics. Resource assignments are 
configuration parameters, and these are reported separately. 


This approach encourages use of the 3x3 Matrix to organize reports of 
results, where the capacity at which the various metrics were 
measured could be included in the title of the matrix (and results 
for multiple capacities would result in separate 3x3 Matrices, if 
there were sufficient measurements/results to organize in that way). 


For example, results for each VM and VNF could appear in the 3x3 
Matrix, organized to illustrate resource occupation (CPU Cores) ina 
particular physical computing system, as shown below. 


VNF #1 
{|__|__|__ |__| 
Core 1 Fey mena Fee | 
fe, eee rae | 
Po Wot le oy 
Ge Sse ei eaten cat ee L 
VNF #2 
SA 
Cores 2-5 {| __|__ |__| 
Fe ee pee Se 
Weekes Fh. ed 
Ds aie EET sche ot Cs E v 
VNF#3 VNF #4 VNF#5 
TE (aes Pe eee (eat | Rees ee eae 
Core 6 Feet foes ES Fee {__|__|__|__| sme ae Fone pa 
Fe pees Pe a eae FES te iene CF 
a ta? Pee ee alle =] W Gea Vas hae | 
T SR ee RS DP Pee ee Pe F | EE e E EEE v Se ay ess beets ecient 2 r 
VNF #6 
Core 7 een ee es 
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The combination of tables above could be built incrementally, 
beginning with VNF#1 and one Core, then adding VNFs according to 
their supporting Core assignments. X-Y plots of critical benchmarks 
would also provide insight to the effect of increased hardware 
utilization. All VNFs might be of the same type, or to match a 
production environment, there could be VNFs of multiple types and 


categories. In this figure, VNFs #3-#5 are assumed to require small 
CPU resources, while VNF#2 requires four Cores to perform its 
function. 

4.5. Power Consumption 


Although there is incomplete work to benchmark physical network 
function power consumption in a meaningful way, the desire to measure 
the physical Infrastructure supporting the virtual functions only 
adds to the need. Both maximum power consumption and dynamic power 
consumption (with varying load) would be useful. The Intelligent 
Platform Management Interface (IPMI) standard [IPMI2.0] has been 
implemented by many manufacturers and supports measurement of 
instantaneous energy consumption. 


To assess the instantaneous energy consumption of virtual resources, 
it may be possible to estimate the value using an overall metric 
based on utilization readings, according to [NFVIaas—-FRAMEWORK]. 


5. Security Considerations 


Benchmarking activities as described in this memo are limited to 
technology characterization of a DUT/SUT using controlled stimuli in 
a laboratory environment, with dedicated address space and the 
constraints specified in the sections above. 


The benchmarking network topology will be an independent test setup 
and MUST NOT be connected to devices that may forward the test 
traffic into a production network or misroute traffic to the test 
management network. 


Further, benchmarking is performed on a "black-box" basis, relying 
solely on measurements observable external to the DUT/SUT. 


Special capabilities SHOULD NOT exist in the DUT/SUT specifically for 
benchmarking purposes. Any implications for network security arising 
from the DUT/SUT SHOULD be identical in the lab and in production 
networks. 
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6. IANA Considerations 

This document does not require any IANA actions. 
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